Guiding discovery of protein sequence-structure-function modeling

Abstract Motivation Protein engineering techniques are key in designing novel catalysts for a wide range of reactions. Although approaches vary in their exploration of the sequence-structure-function paradigm, they are often hampered by the labor-intensive steps of protein expression and screening. In this work, we describe the development and testing of a high-throughput in silico sequence-structure-function pipeline using AlphaFold2 and fast Fourier transform docking that is benchmarked with enantioselectivity and reactivity predictions for an ancestral sequence library of fungal flavin-dependent monooxygenases. Results The predicted enantioselectivities and reactivities correlate well with previously described screens of an experimentally available subset of these proteins and capture known changes in enantioselectivity across the phylogenetic tree representing ancestorial proteins from this family. With this pipeline established as our functional screen, we apply ensemble decision tree models and explainable AI techniques to build sequence-function models and extract critical residues within the binding site and the second-sphere residues around this site. We demonstrate that the top-identified key residues in the control of enantioselectivity and reactivity correspond to experimentally verified residues. The in silico sequence-to-function pipeline serves as an accelerated framework to inform protein engineering efforts from vast informative sequence landscapes contained in protein families, ancestral resurrects, and directed evolution campaigns. Availability Jupyter notebooks detailing the sequence-structure-function pipeline are available at https://github.com/BrooksResearchGroup-UM/seq_struct_func


Grid-based Minimization
Grids using the same parameters as previously used for FFTDock 7 were generated with varying epsilon.The top 500 poses from FFTDock were minimized using 50 steps of SD and 1000 steps of ABNR with a TOLENR of .1.The reported docking energy was the final minimized grid energy subtracted by the initial energy of the ligand in vacuum.

Explicit Atom Minimization
Protein structures with FAD cofactor were used to minimize the 500 FFTDock 7 poses in vacuum, with CUTNB of 12 Å, CTOFNB of 10 Å, CTONNB of 10 Å, RDIE, varying epsilon, SWITCH, and VSWITCH.Protein and FAD atoms were fixed with CONS FIX.Minimization was performed with 50 steps of SD followed by 1000 steps of ABNR with TOLENR of .001.The reported docking energy was the final minimized ligand energy subtracted by the initial ligand energy.

Simulated Annealing
Docking based on the CHARMM simulated annealing protocol 7 was performed in vacuum with SWITCH, VSWITCH, CUTNB of 12 Å, CTOFNB of 10 Å, and CTONNB of 8 Å, and RDIE with varying epsilon.Grids were generated with the same protocol for FFTDock but varying softcore parameters.The soft grid used an EMAX of 3 kcal/mol, MINE of -20 kcal/mol, and MAXE of 40 kcal/mol.The mine grid used an EMAX of 15 kcal/mol, MINE of -120 kcal/mol, and MAXE of -2 kcal/mol.The hard grid used an EMAX of 100 kcal/mol, MINE of -100 kcal/mol, and MAXE of 100 kcal/mol.CHARMM OpenMM_dock 7 (OMMD) was used to carry out parallel simulated annealing on the 500 rotamers.OMMD grid was used to read in the mine grid and soft grid.The top 10 poses from FFTDock were used as starting conformers to generate 500 total rotamers for simulated annealing.The ligand selection and NUMCOPY of 500 were passed to OMMD build.Starting conformers were briefly minimized in vacuum with 50 steps of SD with TOLENR of 0.01 and 50 steps of ABNR with TOLENR of 0.005.50 rotamers were generated for each conformer pose for a total of 500 poses.Each rotamer was generated by random rotation of the starting conformer between 0 and 180 degrees, random translation between 0 and 2 Å, minimization with 100 steps of SD and 500 steps of ABNR with TOLENR of 0.01 in the soft grid, and lastly minimization in the hard grid using 50 steps of SD and 100 steps of conjugate gradient descent (CONJ) with TOLENR of 0.01.OMMD simulated annealing (SIAN) was carried out on the 500 rotamers with soft potentials, 3000 steps, starting temperature of 300 K, final temperature of 700 K, NHRQ 50, and INCT 1. Next, SIAN was carried out with soft potential, 14000 steps, starting temperature of 700K and final temperature of 300 K. SIAN was then carried out with hard potential, 7000 steps, starting temperature of 500 K and ending temperature of 300 K. Lastly, SIAN was performed with hard potential, 300 steps, starting temperature of 400K, final temperature of 50K.Final poses were minimized with 50 steps of SD and 1000 steps of ABNR and TOLENR of 0.001.

Figure S2 :
Figure S2: Histogram of average pLDDT 16 values across models.pLDDT values were averaged across all residues.

Figure S3 :
Figure S3: Anc 278 colored by predicted Local Distance Test (pLDDT 16 ) from AlphaFold2 10 .Red indicates low scoring regions and blue indicates high scoring regions.The FAD cofactor and docked ligand represent the binding site which posess more conserved residues.

Figure S4 :
Figure S4: Anc 278 pLDDT 16 per residue scores for AlphaFold2 10 pipeline conditioned on consensus MSA versus full AlphaFold pipeline.Both pipelines used monomer model 1 and default parameters.The mean pLDDT score for the consensus model was 65.7 and the mean pLDDT score for the full pipeline was 69.3.The Ca RMSD using TM-align 1 was 2.88 Å between the consensus pipeline and full pipeline Anc 278 structures.

Figure S5 :
Figure S5: Interaction energy between AlphaFold2 structures and FAD cofactor.Red line indicates energy of TropB.Top panel is total interaction energy in kcal/mol, middle panel depicts van der Waals hydrophobic energy, and the bottom panel electorstatic interaction energy.

Figure S6 :
Figure S6: Prediction of stereochemistry from docked structures.The red vector indicates the normal vector the plane of the ligand resorcinol ring defined by three atoms.The blue vector indicates the vector from the ligand resorcinol ring and anion average coordinates to the FAD C13 atom.The angle between the blue and red vectors (131.45 o ) is used to classify the stereochemistry of a pose as R or S, with an angle greater than 90 o as R, otherwise as S.

Figure S7 :
Figure S7: Docking methods MCC vs Accuracy for all protein ligand stereochemistry pairs."Prot" is minimization of FFTDock 7 poses in explicit protein."Annealing" is use of FFTDock poses as starting poses in simulated annealing."Grid" is minimization of FFTDock poses further in soft grid potentials.The rescoring approaches are named as follows: (method)_(method epsilon)_fftdock_(fftdock epsilon).All methods used FFTDock poses generated at epsilon of 3.

Figure S8 :
Figure S8: Accuracy and MCC for stereochemistry prediction against varying epsilon for protein minimization.Accuracy and MCC were calculated using all protein ligand pairs with experimental stereochemistry data.Protein minimization was performed by minimizing FFTDock 7 poses in explicit protein, with FFTDock poses generated at epsilon of 3.

Figure S9 :
Figure S9: Mean stereochemistry results for ancestral FDMO library.Blue squares represent R stereochemistry, yellow squares represent S stereochemistry, and greens squares represent racemic stereochemistry.The top panel indicates mean stereochemistry for previous experimental results.The middle panel indicates mean stereochemistry for protocol prediction for active enzymes.The bottom panel indicates mean stereochemistry for protocol predictions for all enzymes in ancestral FDMO library.

Figure S10 :
Figure S10: Predicted reactivity from logistic regression versus experimental conversion for the ancestral FDMO library.The top panel is the original conversion data for the experimental assay, with black indicating conversion near 100, and white representing 0 conversion.The middle panel is the consensus reactivity label assigned by any nonzero conversion with any ligand.The bottom panel is the predictions from consensus logistic regression model.

Figure S15 :
Figure S15: Effect of residue 239 mutations to ancestral FDMO library and full sequence library.a) Docking energy change between mutation and wildtype top docking pose for ancestral FDMO library.F239Y mutations are denoted in blue as they should promote R stereochemistry, and Y239F mutations are denoted in yellow as they should promote S stereochemistry.b) Predicted stereochemistry for ancestral FDMO library, blue squares denote R and yellow squares denote S, with green denoting R/S.c) Wildtype stereochemistries of full sequence library, with blue representing wildtype sequences before F239Y mutation, and orange representing wildtype sequences before Y239F mutation.d) Mutant stereochemistries of full sequence library.

Figure S16 :
Figure S16: Binding site and second shell residues included after dropping columns with more than 10% gaps represented in red in TropB, excluded residues in white.

Figure S17 :
Figure S17: Boxplot of default model performances from MLJAR 18 for stereochemistry prediction on wildtype sequences.

Figure S18 :
Figure S18: Boxplot of default model performances from MLJAR 18 for conversion prediction on wildtype sequences.

Figure S19 :
Figure S19: Boxplot of tuned model performances from MLJAR 18 for stereochemistry prediction on wildtype library.

Figure S20 :
Figure S20: Boxplot of tuned model performances from MLJAR 18 for reactivity prediction on wildtype library.

Figure S21 :
Figure S21: Heatmap of mean SHAP 19 values by residue for every fold of every model for stereochemistry prediction of wildtype sequences.

Figure S22 :
Figure S22: Bar plot of mean SHAP 19 values averaged across all folds/models for stereochemistry prediction of wildtype sequences.

Figure S23 :
Figure S23: Heatmap of mean SHAP 19 values by residues for every fold of every model for reactivity prediction of wildtype sequences.

Figure S24 :
Figure S24: Bar plot of mean SHAP 19 values averaged across all folds/models for reactivity prediction of wildtype sequences

Table S1 :
Comparison of different protein structure prediction methods for TropB.AlphaFold2_cons indicates using MSA generation preconditioned on consensus sequence hits with AlphaFold2 pipeline.AlphaFold2_dummy indicates using dummy templates instead of PDB template hits.

Table S3 :
Logistic regression model parameters.Average Pafnucy 17 pKd efficiency is an order of magnitude smaller than average docking energy efficiency, therefore coefficients between the two approaches are comparable.